speaker 1
Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
Dong, Ximing, Wang, Shaowei, Lin, Dayi, Hassan, Ahmed E.
Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.
- North America > Canada > Manitoba (0.04)
- Asia > Singapore (0.04)
Voice Impression Control in Zero-Shot TTS
Fujita, Keinichi, Horiguchi, Shota, Ijima, Yusuke
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.
Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis
Sponholz, Jakob, Weilinghoff, Andreas, Schopf, Juliane
In qualitative research, data transcription is often labor-intensive and time-consuming. To expedite this process, a workflow utilizing artificial intelligence (AI) was developed. This workflow not only enhances transcription speed but also addresses the issue of AI-generated transcripts often lacking compatibility with standard content analysis software. Within this workflow, automatic speech recognition is employed to create initial transcripts from audio recordings, which are then formatted to be compatible with content analysis software such as ATLAS.ti or MAXQDA. Empirical data from a study of 12 interviews suggests that this workflow can reduce transcription time by up to 46.2%. Furthermore, by using widely used standard software, this process is suitable for both students and researchers while also being adaptable to a variety of learning, teaching, and research environments. It is also particularly beneficial for non-native speakers. In addition, the workflow is GDPR-compliant and facilitates local, offline transcript generation, which is crucial when dealing with sensitive data.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > Scotland (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (2 more...)
- Workflow (1.00)
- Research Report > Experimental Study (0.34)
- Information Technology > Security & Privacy (0.87)
- Law (0.86)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.90)
A State-Space Model for Decoding Auditory Attentional Modulation from MEG in a Competing-Speaker Environment
Sahar Akram, Jonathan Z. Simon, Shihab A. Shamma, Behtash Babadi
Humans are able to segregate auditory objects in a complex acoustic scene, through an interplay of bottom-up feature extraction and top-down selective attention in the brain. The detailed mechanism underlying this process is largely unknown and the ability to mimic this procedure is an important problem in artificial intelligence and computational neuroscience. We consider the problem of decoding the attentional state of a listener in a competing-speaker environment from magnetoencephalographic (MEG) recordings from the human brain. We develop a behaviorally inspired state-space model to account for the modulation of the MEG with respect to attentional state of the listener. We construct a decoder based on the maximum a posteriori (MAP) estimate of the state parameters via the Expectation-Maximization (EM) algorithm. The resulting decoder is able to track the attentional modulation of the listener with multi-second resolution using only the envelopes of the two speech streams as covariates.
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.48)
Extracting triples from dialogues for conversational social agents
Vossen, Piek, Santamaría, Selene Báez, Bajčetić, Lenka, Belluci, Thomas
Obtaining an explicit understanding of communication within a Hybrid Intelligence collaboration is essential to create controllable and transparent agents. In this paper, we describe a number of Natural Language Understanding models that extract explicit symbolic triples from social conversation. Triple extraction has mostly been developed and tested for Knowledge Base Completion using Wikipedia text and data for training and testing. However, social conversation is very different as a genre in which interlocutors exchange information in sequences of utterances that involve statements, questions, and answers. Phenomena such as co-reference, ellipsis, coordination, and implicit and explicit negation or confirmation are more prominent in conversation than in Wikipedia text. We therefore describe an attempt to fill this gap by releasing data sets for training and testing triple extraction from social conversation. We also created five triple extraction models and tested them in our evaluation data. The highest precision is 51.14 for complete triples and 69.32 for triple elements when tested on single utterances. However, scores for conversational triples that span multiple turns are much lower, showing that extracting knowledge from true conversational data is much more challenging.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- North America > United States > New York (0.04)
- (5 more...)
Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion
Hassan, Syed Zohaib, Lison, Pierre, Halvorsen, Pål
Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness of synthesized speech, which is an important criteria when building conversational agents that aim to mimick human behaviours. We show how the insertion of disfluencies can alleviate this shortcoming. The proposed approach involves (1) fine-tuning an LLM with Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into LLM-generated utterances and (2) synthesizing those utterances using a text-to-speech model that supports the generation of speech phenomena such as disfluencies. We evaluated the quality of the generated speech across two metrics: intelligibility and perceived spontaneity. We demonstrate through a user study that the insertion of disfluencies significantly increase the perceived spontaneity of the generated speech. This increase came, however, along with a slight reduction in intelligibility.
- Europe > Norway > Eastern Norway > Oslo (0.05)
- North America > United States > Mississippi > Mississippi County > Mississippi State (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report > Experimental Study (0.94)
- Questionnaire & Opinion Survey (0.90)
Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization
Thulke, David, Gao, Yingbo, Jalota, Rricha, Dugast, Christian, Ney, Hermann
This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs). Our approach involves initial experiments with prompting existing LLMs to generate summaries of telephone conversations, followed by the creation of a tailored synthetic training dataset utilizing stronger frontier models. We place special focus on the diversity of the generated data and on the ability to control the length of the generated summaries to meet various use-case specific requirements. The effectiveness of our method is evaluated using two state-of-the-art LLM-as-a-judge-based evaluation techniques to ensure the quality and relevance of the summaries. Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness. Our findings demonstrate the potential for quickly bootstrapping a practical and efficient call summarization system.
- Asia > Thailand > Bangkok > Bangkok (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (8 more...)
- Media > Film (0.93)
- Leisure & Entertainment (0.68)
KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark
Jang, Seongbo, Lee, Seonghyeon, Yu, Hwanjo
As language models are often deployed as chatbot assistants, it becomes a virtue for models to engage in conversations in a user's first language. While these models are trained on a wide range of languages, a comprehensive evaluation of their proficiency in low-resource languages such as Korean has been lacking. In this work, we introduce KoDialogBench, a benchmark designed to assess language models' conversational capabilities in Korean. To this end, we collect native Korean dialogues on daily topics from public sources, or translate dialogues from other languages. We then structure these conversations into diverse test datasets, spanning from dialogue comprehension to response selection tasks. Leveraging the proposed benchmark, we conduct extensive evaluations and analyses of various language models to measure a foundational understanding of Korean dialogues. Experimental results indicate that there exists significant room for improvement in models' conversation skills. Furthermore, our in-depth comparisons across different language models highlight the effectiveness of recent training techniques in enhancing conversational proficiency. We anticipate that KoDialogBench will promote the progress towards conversation-aware Korean language models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (18 more...)
A State Space Model for Decoding Auditory Attentional Modulation from MEG in a Competing Speaker Environment
Humans are able to segregate auditory objects in a complex acoustic scene, through an interplay of bottom-up feature extraction and top-down selective attention in the brain. The detailed mechanism underlying this process is largely unknown and the ability to mimic this procedure is an important problem in artificial intelligence and computational neuroscience. We consider the problem of decoding the attentional state of a listener in a competing-speaker environment from magnetoencephalographic (MEG) recordings from the human brain. We develop a behaviorally inspired state-space model to account for the modulation of the MEG with respect to attentional state of the listener. We construct a decoder based on the maximum a posteriori (MAP) estimate of the state parameters via the Expectation-Maximization (EM) algorithm. The resulting decoder is able to track the attentional modulation of the listener with multi-second resolution using only the envelopes of the two speech streams as covariates.
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.48)
Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models
Sicilia, Anthony, Kim, Hyunwoo, Chandu, Khyathi Raghavi, Alikhani, Malihe, Hessel, Jack
Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing "conversation forecasting" task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size.
- North America > United States > Alaska (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (18 more...)
- Government (1.00)
- Media > News (0.46)
- Law > Statutes (0.46)
- Education > Educational Setting > K-12 Education (0.46)